{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Optimizing Vision Transformer Model for Deployment\n\n[Jeff Tang](https://github.com/jeffxtang),\n[Geeta Chauhan](https://github.com/gchauhan/)\n\nVision Transformer models apply the cutting-edge attention-based\ntransformer models, introduced in Natural Language Processing to achieve\nall kinds of the state of the art (SOTA) results, to Computer Vision\ntasks. Facebook Data-efficient Image Transformers [DeiT](https://ai.facebook.com/blog/data-efficient-image-transformers-a-promising-new-technique-for-image-classification)\nis a Vision Transformer model trained on ImageNet for image\nclassification.\n\nIn this tutorial, we will first cover what DeiT is and how to use it,\nthen go through the complete steps of scripting, quantizing, optimizing,\nand using the model in iOS and Android apps. We will also compare the\nperformance of quantized, optimized and non-quantized, non-optimized\nmodels, and show the benefits of applying quantization and optimization\nto the model along the steps.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## What is DeiT\n\nConvolutional Neural Networks (CNNs) have been the main models for image\nclassification since deep learning took off in 2012, but CNNs typically\nrequire hundreds of millions of images for training to achieve the\nSOTAresults. DeiT is a vision transformer model that requires a lot less\ndata and computing resources for training to compete with the leading\nCNNs in performing image classification, which is made possible by two\nkey components of of DeiT:\n\n-  Data augmentation that simulates training on a much larger dataset;\n-  Native distillation that allows the transformer network to learn from\n   a CNN\u2019s output.\n\nDeiT shows that Transformers can be successfully applied to computer\nvision tasks, with limited access to data and resources. For more\ndetails on DeiT, see the [repo](https://github.com/facebookresearch/deit)\nand [paper](https://arxiv.org/abs/2012.12877).\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Classifying Images with DeiT\n\nFollow the README at the DeiT repo for detailed information on how to\nclassify images using DeiT, or for a quick test, first install the\nrequired packages:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# pip install torch torchvision timm pandas requests"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To run in Google Colab, uncomment the following line:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# !pip install timm pandas requests"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "then run the script below:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from PIL import Image\nimport torch\nimport timm\nimport requests\nimport torchvision.transforms as transforms\nfrom timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD\n\nprint(torch.__version__)\n# should be 1.8.0\n\n\nmodel = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)\nmodel.eval()\n\ntransform = transforms.Compose([\n    transforms.Resize(256, interpolation=3),\n    transforms.CenterCrop(224),\n    transforms.ToTensor(),\n    transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD),\n])\n\nimg = Image.open(requests.get(\"https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png\", stream=True).raw)\nimg = transform(img)[None,]\nout = model(img)\nclsidx = torch.argmax(out)\nprint(clsidx.item())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The output should be 269, which, according to the ImageNet list of class\nindex to [labels file](https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a), maps to \u2018timber\nwolf, grey wolf, gray wolf, Canis lupus\u2019.\n\nNow that we have verified that we can use the DeiT model to classify\nimages, let\u2019s see how to modify the model so it can run on iOS and\nAndroid apps.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Scripting DeiT\nTo use the model on mobile, we first need to script the\nmodel. See the [Script and Optimize recipe](https://pytorch.org/tutorials/recipes/script_optimized.html) for a\nquick overview. Run the code below to convert the DeiT model used in the\nprevious step to the TorchScript format that can run on mobile.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)\nmodel.eval()\nscripted_model = torch.jit.script(model)\nscripted_model.save(\"fbdeit_scripted.pt\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The scripted model file fbdeit_scripted.pt of size about 346MB is\ngenerated.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Quantizing DeiT\nTo reduce the trained model size significantly while\nkeeping the inference accuracy about the same, quantization can be\napplied to the model. Thanks to the transformer model used in DeiT, we\ncan easily apply dynamic-quantization to the model, because dynamic\nquantization works best for LSTM and transformer models (see [here](https://pytorch.org/docs/stable/quantization.html?highlight=quantization#dynamic-quantization)\nfor more details).\n\nNow run the code below:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Use 'fbgemm' for server inference and 'qnnpack' for mobile inference\nbackend = \"fbgemm\" # replaced with qnnpack causing much worse inference speed for quantized model on this notebook\nmodel.qconfig = torch.quantization.get_default_qconfig(backend)\ntorch.backends.quantized.engine = backend\n\nquantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)\nscripted_quantized_model = torch.jit.script(quantized_model)\nscripted_quantized_model.save(\"fbdeit_scripted_quantized.pt\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This generates the scripted and quantized version of the model\nfbdeit_quantized_scripted.pt, with size about 89MB, a 74% reduction of\nthe non-quantized model size of 346MB!\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can use the ``scripted_quantized_model`` to generate the same\ninference result:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "out = scripted_quantized_model(img)\nclsidx = torch.argmax(out)\nprint(clsidx.item())\n# The same output 269 should be printed"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Optimizing DeiT\nThe final step before using the quantized and scripted\nmodel on mobile is to optimize it:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.utils.mobile_optimizer import optimize_for_mobile\noptimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)\noptimized_scripted_quantized_model.save(\"fbdeit_optimized_scripted_quantized.pt\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The generated fbdeit_optimized_scripted_quantized.pt file has about the\nsame size as the quantized, scripted, but non-optimized model. The\ninference result remains the same.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "out = optimized_scripted_quantized_model(img)\nclsidx = torch.argmax(out)\nprint(clsidx.item())\n# Again, the same output 269 should be printed"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Using Lite Interpreter\n\nTo see how much model size reduction and inference speed up the Lite\nInterpreter can result in, let\u2019s create the lite version of the model.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimized_scripted_quantized_model._save_for_lite_interpreter(\"fbdeit_optimized_scripted_quantized_lite.ptl\")\nptl = torch.jit.load(\"fbdeit_optimized_scripted_quantized_lite.ptl\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Although the lite model size is comparable to the non-lite version, when\nrunning the lite version on mobile, the inference speed up is expected.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Comparing Inference Speed\n\nTo see how the inference speed differs for the four models - the\noriginal model, the scripted model, the quantized-and-scripted model,\nthe optimized-quantized-and-scripted model - run the code below:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "with torch.autograd.profiler.profile(use_cuda=False) as prof1:\n    out = model(img)\nwith torch.autograd.profiler.profile(use_cuda=False) as prof2:\n    out = scripted_model(img)\nwith torch.autograd.profiler.profile(use_cuda=False) as prof3:\n    out = scripted_quantized_model(img)\nwith torch.autograd.profiler.profile(use_cuda=False) as prof4:\n    out = optimized_scripted_quantized_model(img)\nwith torch.autograd.profiler.profile(use_cuda=False) as prof5:\n    out = ptl(img)\n\nprint(\"original model: {:.2f}ms\".format(prof1.self_cpu_time_total/1000))\nprint(\"scripted model: {:.2f}ms\".format(prof2.self_cpu_time_total/1000))\nprint(\"scripted & quantized model: {:.2f}ms\".format(prof3.self_cpu_time_total/1000))\nprint(\"scripted & quantized & optimized model: {:.2f}ms\".format(prof4.self_cpu_time_total/1000))\nprint(\"lite model: {:.2f}ms\".format(prof5.self_cpu_time_total/1000))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The results running on a Google Colab are:\n\n::\n\n   original model: 1236.69ms\n   scripted model: 1226.72ms\n   scripted & quantized model: 593.19ms\n   scripted & quantized & optimized model: 598.01ms\n   lite model: 600.72ms\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following results summarize the inference time taken by each model\nand the percentage reduction of each model relative to the original\nmodel.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\nimport numpy as np\n\ndf = pd.DataFrame({'Model': ['original model','scripted model', 'scripted & quantized model', 'scripted & quantized & optimized model', 'lite model']})\ndf = pd.concat([df, pd.DataFrame([\n    [\"{:.2f}ms\".format(prof1.self_cpu_time_total/1000), \"0%\"],\n    [\"{:.2f}ms\".format(prof2.self_cpu_time_total/1000),\n     \"{:.2f}%\".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],\n    [\"{:.2f}ms\".format(prof3.self_cpu_time_total/1000),\n     \"{:.2f}%\".format((prof1.self_cpu_time_total-prof3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],\n    [\"{:.2f}ms\".format(prof4.self_cpu_time_total/1000),\n     \"{:.2f}%\".format((prof1.self_cpu_time_total-prof4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],\n    [\"{:.2f}ms\".format(prof5.self_cpu_time_total/1000),\n     \"{:.2f}%\".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],\n    columns=['Inference Time', 'Reduction'])], axis=1)\n\nprint(df)\n\n\"\"\"\n        Model                             Inference Time    Reduction\n0\toriginal model                             1236.69ms           0%\n1\tscripted model                             1226.72ms        0.81%\n2\tscripted & quantized model                  593.19ms       52.03%\n3\tscripted & quantized & optimized model      598.01ms       51.64%\n4\tlite model                                  600.72ms       51.43%\n\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Learn More\n\n- [Facebook Data-efficient Image Transformers](https://ai.facebook.com/blog/data-efficient-image-transformers-a-promising-new-technique-for-image-classification)_\n- [Vision Transformer with ImageNet and MNIST on iOS](https://github.com/pytorch/ios-demo-app/tree/master/ViT4MNIST)_\n- [Vision Transformer with ImageNet and MNIST on Android](https://github.com/pytorch/android-demo-app/tree/master/ViT4MNIST)_\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}